IRDC-Net: Lightweight Semantic Segmentation Network Based on Monocular Camera for Mobile Robot Navigation

Computer vision plays a significant role in mobile robot navigation due to the wealth of information extracted from digital images. Mobile robots localize and move to the intended destination based on the captured images. Due to the complexity of the environment, obstacle avoidance still requires a complex sensor system with a high computational efficiency requirement. This study offers a real-time solution to the problem of extracting corridor scenes from a single image using a lightweight semantic segmentation model integrating with the quantization technique to reduce the numerous training parameters and computational costs. The proposed model consists of an FCN as the decoder and MobilenetV2 as the decoder (with multi-scale fusion). This combination allows us to significantly minimize computation time while achieving high precision. Moreover, in this study, we also propose to use the Balance Cross-Entropy loss function to handle diverse datasets, especially those with class imbalances and to integrate a number of techniques, for example, the Adam optimizer and Gaussian filters, to enhance segmentation performance. The results demonstrate that our model can outperform baselines across different datasets. Moreover, when being applied to practical experiments with a real mobile robot, the proposed model’s performance is still consistent, supporting the optimal path planning, allowing the mobile robot to efficiently and effectively avoid the obstacles.


Introduction
Mobile robots (MRs) safely navigate their environments by recognizing obstacles in real time. MR's navigation assistance systems detect obstacles through laser scanners [1], sensors [2], and cameras [3]. The navigation systems of complex environments are prohibitively expensive, as they require a considerable amount of computing power [2,3]. The use of Lidar or cameras has become widespread recently. Lidar automatically measures the distance to obstacles, detects the object's boundary regions, and maintains an MR's perception of the environment [4]. However, environmental conditions such as lighting, fog, or rain can negatively influence the process of collecting Lidar data.
Furthermore, many kinds of obstacles in indoor environments block or decline laser beams, making the representation of moving environments problematic [5]. To overcome the limitation of Lidar in indoor environments, Lidar can be combined with other sensors or camera systems to improve the data collection ability [4][5][6][7]. Alternatively, cameras offer inexpensive scene data to detect any object [8,9]. Due to the prevalence of affordable, highprecision monocular cameras, previously existing drawbacks have been eliminated. Thus, real-time image segmentation and MR's path planning have been accomplished [8,10].
Semantic segmentation utilizing deep learning (DL) is a fundamental challenge in many vision-based applications [11][12][13][14][15], including scene interpretation, object detection, The IRDC-Net architecture, which consists of the IR and DC is depicted in Figure 2 The first layer presents a 1 × 1 convolution with Relu6 activation function. The second layer is identical to the following 3 × 3 depthwise convolution as a DC to reduce its parameters. Following this, the third layer is 1 × 1 convolution without any activation function. "Linear" block is replaced by "Relu" Block. The architecture uses two residual blocks with "stride = 1" and "stride = 2" to serve intermediate layers. The difference between IR and the original one lies in the adjustment of the skip connection used in MobilenetV2 The IR requires fewer input and output channels for each residual block (bottleneck layer) [28], as shown in Figure 3. The IRDC-Net architecture, which consists of the IR and DC is depicted in Figure 2. The first layer presents a 1 × 1 convolution with Relu6 activation function. The second layer is identical to the following 3 × 3 depthwise convolution as a DC to reduce its parameters. Following this, the third layer is 1 × 1 convolution without any activation function. "Linear" block is replaced by "Relu" Block. The architecture uses two residual blocks with "stride = 1" and "stride = 2" to serve intermediate layers. The difference between IR and the original one lies in the adjustment of the skip connection used in MobilenetV2. The IR requires fewer input and output channels for each residual block (bottleneck layer) [28], as shown in Figure 3. The IR blocks in IRDC-Net compress linked the layers where the skip connection are connected. In contrast, the original residuals used in ResNet [29,30] have input an output channels that are more than that of the intermediate layers. In Figure 3c,d, linea bottleneck and inverted residual blocks between bottlenecks are also recommended in ad dition to MobileNetV2 depth-separable structures.
As for DC, instead of using a single kernel (filter) to conduct convolution computa tions on the entire input channel, it employs a different kernel for each input channel. Th allows us to reduce the number of parameters and computations, as we only need to com pute the convolution on a single channel at a time rather than on all channels. We can us a standard convolution layer; features from separate channels can be combined.
In the proposed semantic segmentation model, the decoder's architecture is con structed as follows:  The IR blocks in IRDC-Net compress linked the layers where the skip conne are connected. In contrast, the original residuals used in ResNet [29,30] have inpu output channels that are more than that of the intermediate layers. In Figure 3c,d, bottleneck and inverted residual blocks between bottlenecks are also recommended dition to MobileNetV2 depth-separable structures.
As for DC, instead of using a single kernel (filter) to conduct convolution com tions on the entire input channel, it employs a different kernel for each input channe allows us to reduce the number of parameters and computations, as we only need to pute the convolution on a single channel at a time rather than on all channels. We ca a standard convolution layer; features from separate channels can be combined.
In the proposed semantic segmentation model, the decoder's architecture is structed as follows:  The IR blocks in IRDC-Net compress linked the layers where the skip connections are connected. In contrast, the original residuals used in ResNet [29,30] have input and output channels that are more than that of the intermediate layers. In Figure 3c,d, linear bottleneck and inverted residual blocks between bottlenecks are also recommended in addition to MobileNetV2 depth-separable structures.
As for DC, instead of using a single kernel (filter) to conduct convolution computations on the entire input channel, it employs a different kernel for each input channel. This allows us to reduce the number of parameters and computations, as we only need to compute the convolution on a single channel at a time rather than on all channels. We can use a standard convolution layer; features from separate channels can be combined.
In the proposed semantic segmentation model, the decoder's architecture is constructed as follows:
Our experiments are carried out with the following configuration: Python 3.11.0; Ten-sorFlow 1.4 framework; a computer with Core I7 11th generation processor 2.50 GHz, Nvidia 2080TI graphics card with 12GB VRAM, 32 GB RAM, and a 64-bit operating system.
According to the actual data of the TaQuangBuu library image updating the self-collected images in [7,10], the ratio of pixels that should be on the path to pixels that should be obstacles is quite high. In our previous studies [7,10], the Binary Cross Entropy loss function for two classes, such as available and unavailable regions, was used. However, this led to an imbalance in the data. When using the Binary Cross Entropy as loss function, the learning model tends to favor the object that appears more frequently in the data. Adding more instances of the less dominant class to training data might potentially solve  [31], KITTI [32], Ducktown [33], and TaQuangBuu's dataset.
Our experiments are carried out with the following configuration: Python 3.11.0; TensorFlow 1.4 framework; a computer with Core I7 11th generation processor 2.50 GHz, Nvidia 2080TI graphics card with 12 GB VRAM, 32 GB RAM, and a 64-bit operating system.
According to the actual data of the TaQuangBuu library image updating the selfcollected images in [7,10], the ratio of pixels that should be on the path to pixels that should be obstacles is quite high. In our previous studies [7,10], the Binary Cross Entropy loss function for two classes, such as available and unavailable regions, was used. However, this led to an imbalance in the data. When using the Binary Cross Entropy as loss function, the learning model tends to favor the object that appears more frequently in the data. Adding more instances of the less dominant class to training data might potentially solve the problem. Therefore, we propose to use the Balanced Cross-Entropy loss (BCE) function [34] as in Equation (1): whereŷ is the class SoftMax probability and y is the ground truth of the corresponding prediction. β = 1 − y H×W , and H × W presents the total of pixels in the image. Furthermore, β is used for adjusting the number of false negatives and false positives as follows: reducing the number of false negatives when β > 1 or reducing the number of false positives when β < 1.
The Balanced Cross-Entropy (BCE) loss function offers the following advantages: -Unbalanced Data Processing: In binary classification problems, the BCE function addresses the issue of unbalanced data. It ensures that if the sample ratio between the two classes is unequal, the smaller sample will be considered more significant. This prevents the model from being biased towards the larger sample. -Error balancing: The BCE function takes into account error levels in both classes. This causes the model to strive to minimize the mean error for both classes, as opposed to concentrating excessively on the minority class. -Increased accuracy: By managing unbalanced data and equalizing errors, the BCE function can improve model accuracy in binary classification problems. It serves to balance class-based decisions and minimizes the impact of minority information.
Additionally, the BCE function can be applied to multilayer image segmentation issues. The BCE function is maximally efficient with a variety of datasets, particularly unbalanced datasets. Hence, path planning will function more effectively in a variety of internal environments.
To optimize the balanced cross-entropy defined in Equation (1), the Adam optimizer [35] was used. The model was trained with a learning rate of 0.001 and for 100 epochs. The dataset was pre-processed with Gaussian blur [36] (as defined in Equation (2)) and Gaussian noise [37] (as defined in Equation (3)) to ensure the quality of raw images before passing them through the proposed segmentation model shown in Figure 5. By using the aforementioned algorithms, the image quality will be altered, but it can create more generalized datasets, enhancing the segmentation model's quality. the problem. Therefore, we propose to use the Balanced Cross-Entropy loss (BCE) function [34] as in Equation (1): Balanced CE L y y y y y y (1) where ŷ is the class SoftMax probability and y is the ground truth of the correspond- × H W presents the total of pixels in the image. Furthermore, β is used for adjusting the number of false negatives and false positives as follows: reducing the number of false negatives when > 1 β or reducing the number of false positives when < 1.
β The Balanced Cross-Entropy (BCE) loss function offers the following advantages: -Unbalanced Data Processing: In binary classification problems, the BCE function addresses the issue of unbalanced data. It ensures that if the sample ratio between the two classes is unequal, the smaller sample will be considered more significant. This prevents the model from being biased towards the larger sample. Additionally, the BCE function can be applied to multilayer image segmentation issues. The BCE function is maximally efficient with a variety of datasets, particularly unbalanced datasets. Hence, path planning will function more effectively in a variety of internal environments.
To optimize the balanced cross-entropy defined in Equation (1), the Adam optimizer [35] was used. The model was trained with a learning rate of 0.001 and for 100 epochs. The dataset was pre-processed with Gaussian blur [36] (as defined in Equation (2)) and Gaussian noise [37] (as defined in Equation (3)) to ensure the quality of raw images before passing them through the proposed segmentation model shown in Figure 5. By using the aforementioned algorithms, the image quality will be altered, but it can create more generalized datasets, enhancing the segmentation model's quality. The Gaussian blur is an image filtering technique to calculate the transformation to each pixel in the image using a Gaussian function. In two dimensions, each dimension is shown below: where x : the horizontal distance from the origin; y : the vertical distance from the origin σ : the standard deviation of the Gaussian distribution. It is essential to observe that the The Gaussian blur is an image filtering technique to calculate the transformation to each pixel in the image using a Gaussian function. In two dimensions, each dimension is shown below: where x: the horizontal distance from the origin; y: the vertical distance from the origin, σ: the standard deviation of the Gaussian distribution. It is essential to observe that the origin of these axes is centered (0, 0). Based on this formula, it generates a two-dimensional surface, the contours of which are concentric circles with a Gaussian distribution outward from the center. In digital image processing, Gaussian noise will be reduced using a spatial filter. An undesirable consequence may be the blurring of fine-scaled image edges to smooth an image, meaning that one must make details corresponding to the blocked high frequencies.
The probability density function p of a Gaussian random variable z is in Equation (3): where z: the grey level, µ: the mean grey value, and σ: the standard deviation.

Quantization
Quantization [38] is a technique that reduces the size and performance requirements of a machine-learning model by representing its parameters with reduced precision. Parameters such as the weight and bias of a neural network are frequently represented with high precision using floating-point data during the training phase of a machine-learning model. However, this requires a large amount of storage and computational resources, which can be a challenge when deploying the model on devices with limited resources, such as mobile devices or embedded microcontrollers. By quantizing the model, in other words, representing the model's parameters as limited-precision integers or real-number data, the storage size can be reduced while the computational performance is increased. This can be achieved through quantization techniques such as weight quantization, activation quantization, or a combination of both. To minimize the model's size, the model was converted from FP32 (32-bit floating-point precision) to FP16 (16-bit floating-point precision) during the training experiment shown in Figure 6. The quantization process typically includes two main steps as follows: Precision Calibration; Layer and Tensor Fusion.
origin of these axes is centered (0, 0). Based on this formula, it generates a two-dimensional surface, the contours of which are concentric circles with a Gaussian distribution outward from the center.
In digital image processing, Gaussian noise will be reduced using a spatial filter. An undesirable consequence may be the blurring of fine-scaled image edges to smooth an image, meaning that one must make details corresponding to the blocked high frequencies. The probability density function p of a Gaussian random variable z is in Equation where : z the grey level, μ : the mean grey value, and σ : the standard deviation.

Quantization
Quantization [38] is a technique that reduces the size and performance requirements of a machine-learning model by representing its parameters with reduced precision. Parameters such as the weight and bias of a neural network are frequently represented with high precision using floating-point data during the training phase of a machine-learning model. However, this requires a large amount of storage and computational resources, which can be a challenge when deploying the model on devices with limited resources, such as mobile devices or embedded microcontrollers. By quantizing the model, in other words, representing the model's parameters as limited-precision integers or real-number data, the storage size can be reduced while the computational performance is increased. This can be achieved through quantization techniques such as weight quantization, activation quantization, or a combination of both. To minimize the model's size, the model was converted from FP32 (32-bit floating-point precision) to FP16 (16-bit floating-point precision) during the training experiment shown in Figure 6. The quantization process typically includes two main steps as follows: Precision Calibration; Layer and Tensor Fusion. -Precision Calibration: During training, FP32 (Floating Point 32) parameters and activations will be converted to FP16. Optimizing it will decrease stagnation and increase inference speed, but at the expense of a slight reduction in model accuracy. In realtime recognition, accuracy and inference speed must sometimes be compromised. -Layer and Tensor Fusion: Layer and tensor merge are performed to optimize GPU memory and bandwidth by merging nodes vertically, horizontally, or both. Vertical merging involves joining successive kernel processes, while horizontal merging involves merging layers with the same layer size and input but differing weights into a single layer. -Precision Calibration: During training, FP32 (Floating Point 32) parameters and activations will be converted to FP16. Optimizing it will decrease stagnation and increase inference speed, but at the expense of a slight reduction in model accuracy. In real-time recognition, accuracy and inference speed must sometimes be compromised. -Layer and Tensor Fusion: Layer and tensor merge are performed to optimize GPU memory and bandwidth by merging nodes vertically, horizontally, or both. Vertical merging involves joining successive kernel processes, while horizontal merging involves merging layers with the same layer size and input but differing weights into a single layer.

Quantitative Results
The proposed method was evaluated using three datasets, including CitySpaces' dataset (5000 images) [31], KITTI's dataset (400 images) [32], and Duckie's dataset collected from Ducktown (1200 images) [33]. In addition, we collected a set of 1200 additional authentic images from the Ta Quang Buu library to enhance the dataset. There were sets of three specifications (Accuracy, Loss, and mIoU) considered for the segmentation model's requirements. Many comparisons with previous methods were carried out.
Based on the input images of CitySpaces shown in Figure 4a, the segmented images in different conditions of environment had still been guaranteed with robust performance in Figure 7.

Quantitative Results
The proposed method was evaluated using three datasets, including CitySpaces' dataset (5000 images) [31], KITTI's dataset (400 images) [32], and Duckie's dataset collected from Ducktown (1200 images) [33]. In addition, we collected a set of 1200 additional authentic images from the Ta Quang Buu library to enhance the dataset. There were sets of three specifications (Accuracy, Loss, and mIoU) considered for the segmentation model's requirements. Many comparisons with previous methods were carried out.
Based on the input images of CitySpaces shown in Figure 4a, the segmented images in different conditions of environment had still been guaranteed with robust performance in Figure 7. In addition, a number of baselines that have a similar encoder's architecture were considered for evaluation, such as DSSPN [39], SqueezeNAS [40], and SaGe [41]. During the training process, the parameters and activation functions were represented in FP32. Consequently, switching to FP16 reduces latency and substantially reduces the model's size. In fact, when converting to FP16, some weights will be reduced due to the smaller range of FP16 compared to FP32, resulting in a modest but insignificant decrease in accuracy. Table 1 illustrates that our segmentation model achieves the highest mIoU among the FCN models using the same Cityspaces' dataset. Furthermore, our lightweight segmentation using reduced training parameters obtained higher validated mIoU, ranging from 2 to 10 percent. The accuracy and performance of the proposed segmentation model are ensured to construct the MR's frontal view later.

Model
Validated mIoU DSSPN [39] 77.8% In addition, a number of baselines that have a similar encoder's architecture were considered for evaluation, such as DSSPN [39], SqueezeNAS [40], and SaGe [41]. During the training process, the parameters and activation functions were represented in FP32. Consequently, switching to FP16 reduces latency and substantially reduces the model's size. In fact, when converting to FP16, some weights will be reduced due to the smaller range of FP16 compared to FP32, resulting in a modest but insignificant decrease in accuracy. Table 1 illustrates that our segmentation model achieves the highest mIoU among the FCN models using the same Cityspaces' dataset. Furthermore, our lightweight segmentation using reduced training parameters obtained higher validated mIoU, ranging from 2 to 10 percent. The accuracy and performance of the proposed segmentation model are ensured to construct the MR's frontal view later. We conducted a comparison between our proposed model and existing segmentation models using the KITI dataset. Figure 4b shows the input images from the KITTI dataset, and Figure 8 demonstrates that the robust segmentation performance of the proposed method is still guaranteed. Moreover, the comparison with baselines such as SDNet [42], SFRSeg [43], and APMoE seg ROB [44] was also performed to prove the positive performance of the proposed methods. Table 2 presents the results of this comparison, showing that the proposed method outperforms the best-performing method, SDNet [39], by nearly 3% in terms of mIoU. Due to the expansion of our dataset with more challenging images and the extensive use of training data, our improved segmentation model using the quantization technique could acquire a more accurate representation of the environment and a faster training process. Thus, these achievements will strongly support constructing the MR's frontal view. Finally, optimal path planning will be successfully designed.
sors 2023, 23, x FOR PEER REVIEW 10 of SqueezeNAS [40] 72.4% SaGe [41] 76.9% IRDC-Net: Lightweight Segmentation 78.1% We conducted a comparison between our proposed model and existing segmentatio models using the KITI dataset. Figure 4b shows the input images from the KITTI datas and Figure 8 demonstrates that the robust segmentation performance of the propos method is still guaranteed. Moreover, the comparison with baselines such as SDNet [4 SFRSeg [43], and APMoE seg ROB [44] was also performed to prove the positive perfo mance of the proposed methods. Table 2 presents the results of this comparison, showin that the proposed method outperforms the best-performing method, SDNet [39], nearly 3% in terms of mIoU. Due to the expansion of our dataset with more challengin images and the extensive use of training data, our improved segmentation model usin the quantization technique could acquire a more accurate representation of the enviro ment and a faster training process. Thus, these achievements will strongly support co structing the MR's frontal view. Finally, optimal path planning will be successfully d signed.

Model
Validated mIoU SDNet [42] 79.62% SFRSeg [43] 77.91% APMoE seg ROB [44] 78.11% IRDC-Net: Lightweight Segmentation 81.11% Furthermore, we self-gathered real-time images, finding it somewhat challenging reflect upon. Next, 1200 images were collected from Ducktown's dataset. The obtain experimental results have proven the feasibility of the proposed model's application realistic environments. In changing light conditions, our proposed model could correct  Furthermore, we self-gathered real-time images, finding it somewhat challenging to reflect upon. Next, 1200 images were collected from Ducktown's dataset. The obtained experimental results have proven the feasibility of the proposed model's application in realistic environments. In changing light conditions, our proposed model could correctly classify ground and non-ground regions, a circumstance that had typically been challenging for humans. This was because color-shifting training was undertaken, which enabled our network to operate effectively in low-light mode. Given the prevalence of corners and intersections in interior environments, the following examples were more intuitive for MRs. We executed the final performance evaluation on a background image containing numerous objects. Our network accurately anticipated the ground limit under challenging conditions, proving the robustness performance of segmented images, as depicted in Figure 9.  The proposed model's performance has been consolidated and compared with the previous segmentation FCN-VGG 16 [7] on the same dataset, as shown in Table 3. In this case, we expanded the dataset with more challenging images. Then, to train the model more effectively, data augmentation was extensively used, resulting in a more accurate representation of the environment, followed by significant performance improvements.

Model
Accuracy Validated mIoU Binary Segmentation FCN-VGG 16 [7] 97.1% 71.8% IRDC-Net: Lightweight Segmentation 98.3% 74.2% Finally, the authors continuously self-collected more than one thousand two hundred images from the TaQuangBuu library. The image input's size was 960 × 1280. The final network performance evaluation was executed on a background image with numerous obstacles and intersections in Figure 10. The proposed model's performance has been consolidated and compared with the previous segmentation FCN-VGG 16 [7] on the same dataset, as shown in Table 3. In this case, we expanded the dataset with more challenging images. Then, to train the model more effectively, data augmentation was extensively used, resulting in a more accurate representation of the environment, followed by significant performance improvements. Finally, the authors continuously self-collected more than one thousand two hundred images from the TaQuangBuu library. The image input's size was 960 × 1280. The final network performance evaluation was executed on a background image with numerous obstacles and intersections in Figure 10.  Figure 11 illustrates the Accuracy, Loss, and mIoU diagrams in both the training and validation process. The diagrams depict the performance metrics when input images were taken from four datasets of CitySpaces, KITTI, Duck-town, and TaQuangBuu library, as shown in Figure 12.   Figure 11 illustrates the Accuracy, Loss, and mIoU diagrams in both the training and validation process. The diagrams depict the performance metrics when input images were taken from four datasets of CitySpaces, KITTI, Duck-town, and TaQuangBuu library, as shown in Figure 12.  Figure 11 illustrates the Accuracy, Loss, and mIoU diagrams in both the training and validation process. The diagrams depict the performance metrics when input images were taken from four datasets of CitySpaces, KITTI, Duck-town, and TaQuangBuu library, as shown in Figure 12.  After undergoing color-shifting training, IRDC-Net's performance is well remained when changing from low-light to dark scenarios. Indoor environments are full of corners and crossroads, making the following scenario more obvious to mobile robots. In Figure  13, the ground boundary in these challenging scenarios is reliably predicted by our network. Twelve snapshots from (a) to (l) of Figure 13 show illustrating the MR turning left to reach the goal point with the support of the local search algorithm. After undergoing color-shifting training, IRDC-Net's performance is well remained when changing from low-light to dark scenarios. Indoor environments are full of corners and crossroads, making the following scenario more obvious to mobile robots. In Figure 13, the ground boundary in these challenging scenarios is reliably predicted by our network. Twelve snapshots from (a) to (l) of Figure 13 show illustrating the MR turning left to reach the goal point with the support of the local search algorithm. After undergoing color-shifting training, IRDC-Net's performance is well remained when changing from low-light to dark scenarios. Indoor environments are full of corners and crossroads, making the following scenario more obvious to mobile robots. In Figure  13, the ground boundary in these challenging scenarios is reliably predicted by our network. Twelve snapshots from (a) to (l) of Figure 13 show illustrating the MR turning left to reach the goal point with the support of the local search algorithm.

Mobile Robot's Frontal View
Firstly, based on the camera's focal length, the point coordinates will be converted into the image plane. The relationship between the image plane and image coordination is shown in Figure 14.

Mobile Robot's Frontal View
Firstly, based on the camera's focal length, the point coordinates will be converted into the image plane. The relationship between the image plane and image coordination is shown in Figure 14. Figure 13. The segmented images captured by the mobile robot camera.

Mobile Robot's Frontal View
Firstly, based on the camera's focal length, the point coordinates will be co into the image plane. The relationship between the image plane and image coord is shown in Figure 14.  Then, image coordinates can be rewritten in homogeneous coordination in Equation (4): The transformation matrix projection from 3D image coordinates (X C , Y C , Z C ) to 2D image plane (x, y): Obtaining the result of Equation (5) yields the output of point coordinates in the image plane: P = (x, y). Then, the transformation from the image plane to the pixel plane will be carried out as follows in Figure 15.
The transformation matrix projection from 3D image coordinates ( , Obtaining the result of Equation (5) yields the output of point coordinates age plane: x y . Then, the transformation from the image plane to the p will be carried out as follows in Figure 15.
Next, the pinhole camera model is derived with the help of homogenous co and projective space. This model describes how to map a three-dimensional sc two-dimensional picture with the help of the following Equation (7):  Figure 16, the authors use homography transformation to c perspective distortion, setting up the pixel plane to plan the MR's path in order the relationship between world coordination (W: 4 × 1) and the image plane The affine transformation is a translation in the 2D plane from Image Plane coordination (x, y) to Pixel Plane coordination (u, v) with O x , O y : the image center (in Equation (6)): Next, the pinhole camera model is derived with the help of homogenous coordinates and projective space. This model describes how to map a three-dimensional scene onto a two-dimensional picture with the help of the following Equation (7): The first transformation matrix presents an extrinsic camera matrix defining the camera's position in the 3D environment. The second transformation matrix presents the intrinsic camera matrix converting the image plane (x,y) to the pixel plane (u,v).
Moreover, in Figure 16, the authors use homography transformation to correct the perspective distortion, setting up the pixel plane to plan the MR's path in order to present the relationship between world coordination (W: 4 × 1) and the image plane (p: 3 × 1). The expression of Equation (8) describes the transformation as follows: where M int is the matrix of Intrinsic parameters (3 × 4) and M ext is the matrix of Extrinsic parameters (4 × 4). Using the mappings between 3D object points (points stated in the object frame) and the projected 2D image points (points in the object seen in the image), we can determine the camera's orientation. As for Equation (5), the treated image as a ground surface Z = 0, if the camera poses [45], is fixed to the MR (rotating only on Z-axis) shown in Equation (9).
r 11 r 12 r 13 t x r 21 r 22 r 23 t y r 31 r 32 r 33 t z 0 0 0 1 For the planar surface Z = 0, the expression (9) can be rewritten as the following (Equation (10) Based on Equations (4)-(10), the mobile robot's frontal view is designed to plan the MR's path. All steps are summarized in Figure 16.
object frame) and the projected 2D image points (points in the object seen in the ima we can determine the camera's orientation. As for Equation (5), the treated image ground surface = 0, Z if the camera poses [45], is fixed to the MR (rotating only o axis) shown in Equation (9). For the planar surface = 0 Z , the expression (9) can be rewritten as the follow (Equation (10)): Based on Equations (4)-(10), the mobile robot's frontal view is designed to plan MR's path. All steps are summarized in Figure 16. Finally, applying homography parameters estimation H with the relation betw perspective plane one ( ) , x y and perspective plane two ( ) ', ' x y [7,10]:   Finally, applying homography parameters estimation H with the relation between perspective plane one (x, y) and perspective plane two (x , y ) [7,10]: Thus, using two sets of four known points (x, y) and (x , y ) to calculate the H matrix, any four points of the pixel plane in the MR's frontal view will be wholly obtained in Figure 17. Furthermore, the homography transformation for the bird's eye view of MR is shown in Figure 18. Using the checkerboard in Figure 19a, the authors perform the perspective correction of MR's frontal views (see Figure 19b). Based on the segmented image, the allowance moving areas will be proportional to the ground coordinate. The positions of MR and obstacles are entirely determined to design the path planning in Figure 19c,d. Figure 17. Furthermore, the homography transformation for the bird's eye view of MR is shown in Figure 18. Using the checkerboard in Figure 19a, the authors perform the perspective correction of MR's frontal views (see Figure 19b). Based on the segmented image the allowance moving areas will be proportional to the ground coordinate. The positions of MR and obstacles are entirely determined to design the path planning in Figure 19c,d.   Figure 17. Furthermore, the homography transformation for the bird's eye view of MR is shown in Figure 18. Using the checkerboard in Figure 19a, the authors perform the perspective correction of MR's frontal views (see Figure 19b). Based on the segmented image the allowance moving areas will be proportional to the ground coordinate. The positions of MR and obstacles are entirely determined to design the path planning in Figure 19c,d.    Figure 17. Furthermore, the homography transformation for the bird's eye view of MR is shown in Figure 18. Using the checkerboard in Figure 19a, the authors perform the perspective correction of MR's frontal views (see Figure 19b). Based on the segmented image, the allowance moving areas will be proportional to the ground coordinate. The positions of MR and obstacles are entirely determined to design the path planning in Figure 19c,d.

Practical Results
In comparison to a previous study [10], the same optimal MR's strategy [10] was used. However, in this study, the proposed FCN-MobilenetV2 model was utilized to obtain segmented images, facilitating constructing the available area for movement, as depicted in Figure 20.

Practical Results
In comparison to a previous study [10], the same optimal MR's strategy [10] was used. However, in this study, the proposed FCN-MobilenetV2 model was utilized to obtain segmented images, facilitating constructing the available area for movement, as depicted in Figure 20. Furthermore, a dedicated local search algorithm was designed to increase the safety of obstacle avoidance when the MR successfully tracks the global path, as shown in Figure  21. After analyzing the new results compared with those obtained in [7], we could draw the conclusion that semantic segmentation is necessary when constructing the ground's frontal perspective. This enables the planning of the most efficient path for MR. The practical experiments conducted in this study focused on enhancing methods for recognizing collision-free zones in local search areas based on the given global path. Since the camera pose was fixed to the MR, a smooth trajectory would affect the performance of the proposed semantic segmentation. In other words, our proposed model would ensure better results compared to previous FCN-VGG 16 [7] with model parameters in Table 3. When being tested on multiple datasets, our proposed model exhibited a remarkable quality enhancement, as shown in Figure 12. Table 4 presented the Furthermore, a dedicated local search algorithm was designed to increase the safety of obstacle avoidance when the MR successfully tracks the global path, as shown in Figure 21. After analyzing the new results compared with those obtained in [7], we could draw the conclusion that semantic segmentation is necessary when constructing the ground's frontal perspective. This enables the planning of the most efficient path for MR. The practical experiments conducted in this study focused on enhancing methods for recognizing collision-free zones in local search areas based on the given global path.

Practical Results
In comparison to a previous study [10], the same optimal MR's strategy [10] was used. However, in this study, the proposed FCN-MobilenetV2 model was utilized to obtain segmented images, facilitating constructing the available area for movement, as depicted in Figure 20. Furthermore, a dedicated local search algorithm was designed to increase the safety of obstacle avoidance when the MR successfully tracks the global path, as shown in Figure  21. After analyzing the new results compared with those obtained in [7], we could draw the conclusion that semantic segmentation is necessary when constructing the ground's frontal perspective. This enables the planning of the most efficient path for MR. The practical experiments conducted in this study focused on enhancing methods for recognizing collision-free zones in local search areas based on the given global path. Since the camera pose was fixed to the MR, a smooth trajectory would affect the performance of the proposed semantic segmentation. In other words, our proposed model would ensure better results compared to previous FCN-VGG 16 [7] with model parameters in Table 3. When being tested on multiple datasets, our proposed model exhibited a remarkable quality enhancement, as shown in Figure 12. Table 4 presented the Since the camera pose was fixed to the MR, a smooth trajectory would affect the performance of the proposed semantic segmentation. In other words, our proposed model would ensure better results compared to previous FCN-VGG 16 [7] with model parameters in Table 3. When being tested on multiple datasets, our proposed model exhibited a remarkable quality enhancement, as shown in Figure 12. Table 4 presented the relationship between changing the steering angle and the accuracy of proposed model, considering the fixed camera pose. It demonstrated that as the steering angle increases, the accuracy of the model decreases significantly. Therefore, our improved semantic segmentation would ensure the smoothness of MR's trajectory and maintain the minimum changes in the steering angle, as depicted in Figure 22. Thus, a smooth trajectory with low steering angle changing would improve the performance of our lightweight semantic segmentation FCN-MobilenetV2 in MR's movement. relationship between changing the steering angle and the accuracy of proposed model, considering the fixed camera pose. It demonstrated that as the steering angle increases, the accuracy of the model decreases significantly. Therefore, our improved semantic segmentation would ensure the smoothness of MR's trajectory and maintain the minimum changes in the steering angle, as depicted in Figure 22. Thus, a smooth trajectory with low steering angle changing would improve the performance of our lightweight semantic segmentation FCN-MobilenetV2 in MR's movement.

Conclusions
This paper proposes a real-time solution to extract corridor scenes from a single image supporting mobile robot navigation. Specifically, a lightweight semantic segmentation model that integrates a quantization technique is introduced to improve the segmentation accuracy while achieving a low computational cost. The evaluation results are

Conclusions
This paper proposes a real-time solution to extract corridor scenes from a single image supporting mobile robot navigation. Specifically, a lightweight semantic segmentation model that integrates a quantization technique is introduced to improve the segmentation accuracy while achieving a low computational cost. The evaluation results are compared with recent methods to demonstrate the feasibility of the proposed method. Moreover, our proposed lightweight semantic segmentation FCN-MobilenetV2 can be significantly better in terms of precision and computation time, compared to the previous semantic segmentation FCN-VGG-16. The practical result shows the successful tracking of the mobile robot's path with a lower 0.05 rad steering angle change. Indeed, our proposed segmentation model is trained and updated from binary classes to multi-classed to identify a wide variety of internal barriers accurately. Therefore, in a real situation, path planning will work better in a variety of indoor settings. In addition, the segmented results will support the local search algorithm in mobile robot path planning. Finally, the safety and avoidance abilities of MR are enhanced against static and dynamic obstacles in unknown environments.

Institutional Review Board Statement:
The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board (or Ethics Committee).